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Abstract 

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide poly- 
morphism (SNP) data, for detecting population structure and potential outliers. However, the size of 
SNP datasets has increased immensely in recent years and PCA of large datasets has become a time con- 
suming task. We have developed f lashpca, a highly efficient PCA implementation based on randomized 
algorithms, which delivers identical accuracy in extracting the top principal components compared with 
existing tools, in substantially less time. We demonstrate the utility of f lashpca on both HapMap3 and 
on a large Immunochip dataset. For the latter, f lashpca performed PCA of 15,000 individuals up to 125 
times faster than existing tools, with identical results, and PCA of 150,000 individuals using f lashpca 
completed in 4 hours. The increasing size of SNP datasets will make tools such as f lashpca essential as 
traditional approaches will not adequately scale. This approach will also help to scale other applications 
that leverage PCA or eigen-decomposition to substantially larger datasets. 

Introduction 

Principal component analysis (PCA) is a widely-used tool in genomics and statistical genetics, employed 
to infer cryptic population structure from genome-wide data such as single nucleotide polymorphisms 
(SNPs) [l][2], and/or to identify outlier individuals which may need to be removed prior to further 
analyses, such as genome- wide association studies (GWAS). This is based on the fact that such population 
structure can confound SNP-phenotype associations, resulting in some SNPs spuriously being called as 
associated with the phenotype (false positives). The top principal components (PCs) of SNP data have 
been shown to map well to geographic distances between human populations [l][3], thus capturing the 
coarse-grain allelic variation between these groups. 

However, traditional approaches to computing the PCA, such as those employed by the popular 
EIGENSOFT suite II], are computationally expensive. For example, PCA based on the singular value 
decomposition (SVD) scales as C(min (N 2 p, Np 2 )) and for eigen-decomposition it is C(min (N 3 ,p 3 )) 
(excluding the cost of computing the covariance matrix itself which is also C(min (N 2 p, Np 2 ))), where N 
and p are the number of samples and SNPs, respectively. This makes it time-consuming to perform PCA 
on large cohorts such as those routinely being analyzed today, involving millions of assayed or imputed 
SNPs and tens of thousands of individuals, with this difficulty only likely to increase in the future with 
the availability of even larger studies. 

In recent years, research into randomized matrix algorithms has yielded alternative approaches for 
performing PCA and producing these top PCs, while being far more computationally tractable and 
maintaining high accuracy relative to the traditional "exact" algorithms [4j[5]. These algorithms are 
especially useful when we are interested in finding only the first few principal components (PCs) of the 
data, as is often the case in genomic analysis. 

Here we present f lashpca, an efficient tool for performing PCA on large genome- wide data, based 
on randomized algorithms. Our approach is highly efficient, allowing the user to perform PCA on large 
datasets (100,000 individuals or more), extracting the top principal components while achieving identical 
results to traditional methods. 
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Results 

First we used an LD-pruned HapMap3 genotype data [6J consisting of 957 human individuals across 11 
populations assayed for 14,389 SNPs (Materials). We compared f lashpca with smartpca from EIGEN- 
SOFT v4.2 [j] and shellfisrj^J In addition, we included the R 3.0.2 prcomp function which is based 
on SVD rather than eigen-decomposition, after replacing its original standardization with the one used 
by smartpca (Equation |4| . The analysis of HapMap3 data revealed the expected ancestry via the first 
two PCs (Figure la), with individuals of east Asian origin (CHB, CHD, JPT) clustered in the bottom 
right-hand side corner, those of European origin (TSI, CEU) in the top, and those of African origin in 
the bottom left-hand side corner (ASW, LWK, YRI, and MKK). All methods showed close to perfect 
agreement on the top PC (Figure lb), with an absolute correlation of 1.00 between the 1st PC of each 
pair of methods (the sign of the eigenvectors is arbitrary hence the correlation may be negative as well) . 
The next nine PCs showed close to perfect agreement as well (see Supplementary File 1 and the online 
documentation for results for PCs 1-10). 

Next, we analyzed an LD-pruned celiac disease Immunochip dataset consisting of 16,002 individuals 
and 43,049 SNPs after thinning by LD 17] (Materials). We then randomly sampled subsets of the original 
dataset with increasing size (N = 500, 1000, 2500, 5000, 7500, 10,000, 15,000), and recorded wall time 
for f lashpca, smartpca, and shellfish performing PCA on these subsets. We also examined larger 
setups (N = 50,000, 100,000, and 150,000) by duplicating the original dataset several times as required. 

Due to the substantial time required by shellfish and smartpca to complete the largest runs, 
we only ran f lashpca on the larger datasets (N > 50,000) (we attempted to run shellfish on the 
N = 50,000 dataset but it did not complete due to running out of memory, and we stopped smartpca 
after 100,000sec). 

Each experiment was repeated three times. All programs used multi-threaded mode with 8 threads. 
All experiments were run on a machine with 4 x 10-core Intel Xeon E7-4850 CPU @ 2.00GHz with 512GiB 
RAM running 64-bit Ubuntu Linux 12.04. Note that wall time here is defined as time from program start 
to successful exit, inclusive of any loading of data, scaling, computation of the covariance matrix, eigen- 
decomposition, and so on, however, the majority of the run time is taken by computing the covariance 
and the eigen-decomposition. smartpca was run without excluding potential population outliers. 

Figure 2 shows that f lashpca was substantially faster than either smartpca or shellfish: for analysis 
of 15,000 samples, f lashpca took an average of 8 minutes, whereas smartpca required an average of 
almost 17h (xl25 slower). Examining the large datasets, f lashpca was able to analyze a dataset of N — 
150,000 in ^4h, which would not be sufficient time for smartpca to complete a PCA on 10,000 samples, 
as 6.5h were required for that. While shellfish was also substantially faster than smartpca, it was still 
considerably slower than f lashpca when the number of individuals was large, with a run time of ~ lh 
for N = 15,000 (and did not complete for subsets with N > 50,000). 

Importantly, the time taken by f lashpca to compute PCA on N = 150,000 did not differ much 
from the time taken on the N = 100,000 subset; this is because f lashpca automatically transposes the 
data when N > p, as the PCA can be computed on the original data or its transpose with only minor 
modifications to the algorithm; computing the p x p covariance matrix and its eigen-decomposition has 
lower computational complexity than using the N x N covariance matrix, for the same values of N and p, 
hence when N > p the main computational cost will not grow substantially with N. 



Discussion 

Principal component analysis is an important tool in genomics for discovery of population structure 
or other latent structure in the data, such as batch effects. Early approaches such as smartpca from 

J http : //www .hsph. harvard. edu/alkes-price/software/ 

^http : //www . stats . ox. ac . uk/~davison/sof tware/shellf ish/shellf ish .php 



Downloaded from http://biorxiv.org/on September 18, 2014 



3 



EIGENSOFT have proven useful for this goal and have been widely used for analysis of SNP datasets. 
However, many current datasets assay tens of thousands of individuals, making traditional approaches 
extremely time consuming. In contrast, our approach, f lashpca, is based on careful combination of 
randomized algorithms for PCA together with parallelization, and allows the analyst to easily perform 
PCA on large datasets consisting of many thousands of individuals in a matter of minutes to hours. 
Despite relying on an approximation strategy, this approach suffers from essentially no loss in accuracy 
for the top eigenvalues/eigenvectors compared with traditional approaches. One practical limitation of 
the current implementation of f lashpca is its memory requirements for large datasets: using 15,000 indi- 
viduals with 43K SNPs requires — 14GiB RAM, and 150,000 requires ~145GiB. Future work will involve 
reducing these memory requirements without incurring a substantial performance penalty. Randomized 
PCA also provides the potential to de-correlate samples ("whitening"), thus essentially removing the 
effects of population from data prior to further downstream association analysis [8] ; this will be examined 
in future work. It has been shown that standard PCA may be an inconsistent estimator of the true 
principal components in certain high dimensional settings and there may be benefit from utilizing 
other approaches such as sparse (-^-penalized) PCA lOpl] ; however, standard PCA remains widely used 



in practice and f lashpca provides an effective way to perform such routine analyses highly efficiently. 
More generally, the approach behind f lashpca could prove useful for accelerating other methods that 



depend on performing a large number of eigen-decompositions across many samples, such as varLD 12 



which assesses local differences in SNP LD between populations or FaST-LMM which implements linear 



mixed models of SNP data 13 



Materials and Methods 

Ethics 

All subjects included in the celiac disease dataset provided written and informed consent. For details, 
see the original publication [7j. 

Datasets 
HapMap3 

The HapMap phase 3 dataset consists of 1184 human individuals across 11 populations (ASW: African 
ancestry in Southwest USA; CEU: Utah residents with Northern and Western European ancestry from 
the CEPH collection; CHB: Han Chinese in Beijing, China; CHD: Chinese in Metropolitan Denver, 
Colorado; GIH: Gujarati Indians in Houston, Texas; JPT: Japanese in Tokyo, Japan; LWK: Luhya in 
Webuye, Kenya; MEX: Mexican ancestry in Los Angeles, California; MKK: Maasai in Kinyawa, Kenya; 
TSI: Toscani in Italia; YRI: Yoruba in Ibadan, Nigeria) assayed for 1,440,616 SNPs ml. We performed 
QC on the data, including removal of SNPs with MAF< 1%, missingness > 1%, and deviation from 
Hardy- Weinberg equilibrium P < 5 x 10~ 6 . We removed non- founders and individuals with genotyping 
missingness > 1%, leaving 957 individuals. Next, we removed several regions of high LD and/or known 
inversions (chr5: 44Mb-51.5Mb, chr6: 25Mb-33.5Mb, chr8: 8Mb-12Mb, dull: 45Mb-57Mb) 111. Fi- 



nally, we used PLINK [15] — indep-pairwise 1000 10 0.02 to thin the SNPs by LD (r 2 < 0.02), 
leaving 14,389 SNPs. 

Celiac Disease Immunochip 

The celiac disease Immunochip dataset [7] consists of 16,002 case/control individuals of British descent, 
assayed for 139,553 SNPs. The QC has been previously described \7J. In addition, we removed SNPs with 
MAF < 0.5% and non-autosomal SNPs, leaving 115,746 SNPs. Next, we removed the same four regions 
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as for the HapMap3 data, and finally, we thinned the SNPs by LD with PLINK — indep-pairwise 50 
10 0.5, leaving 43,049 SNPs. 

Principal Component Analysis 

We represent the genotypes as an N x p matrix X, where the N samples index the rows and the p SNPs 
index the columns. We denote the transpose of the matrix as X T . For the following, we assume that the 
matrix X has been centered so that the mean of each column j is zero (see below for variants on this) . 

PC A relies on finding the eigenvectors of the NxN covariance matrix X = XX T (for notational clarity 
we will ignore the scaling N 1 _ 1 factor which can be accounted for later). This decomposition is performed 
using either the singular value decomposition (SVD) of the matrix X or the eigen-decomposition of the 
covariance matrix itself. 

The SVD of X is 

X = UDV T , (1) 

where U an is an TV x k matrix (U T U = I) the columns of which are the eigenvectors of XX T , D is a 
k x k diagonal matrix of singular values (square root of the eigenvalues of XX T and X T X), and V is 
a p x k matrix (V T V = I) of the eigenvectors of X T X, where k is the matrix rank (this SVD is also 
called the "economy SVD"). Note that SVD does not require the covariance matrix £ to be computed 
explicitly. 

In the eigen-decomposition approach, the covariance matrix S is first explicitly computed, then the 
eigen-decomposition is performed such that 

£ = UAU T , (2) 

where diag(A) = X%, . . . ,\k — diag(D 2 ) are the eigenvalues and U is the matrix of eigenvectors as before. 

The principal components (PCs) P of the data are given by the projection of the data onto the 
eigenvectors 

P = XV = UD, (3) 

where we usually truncate the matrix P to have as many columns as required for any down-stream 
analysis (say, 10). 

Note that some tools, such as smartpca and shellfish, output the eigenvectors U as the principal 
components without weighting by the singular values D, leading to different scales for the PCs. In 
addition, since the (empirical) covariance is typically scaled by a factor of N 1 _ 1 , then in order to maintain 
the interpretation of the singular values D as the square-root of the eigen- values of the scaled covariance 
^3jXX T , the singular values must be scaled by a factor of = as well (as implemented in R's prcomp). 
Note, however, that these scale differences have no effect on the interpretation of the principal components 
for ascertaining or correcting for potential population structure in data. 

In traditional PC A, such as that implemented in R's prcomp, prior to running the SVD/eigen- 
decomposition itself, the matrix X is first mean-centered by subtracting the per-column (SNP) aver- 
age from each column. In contrast, smartpca II], first centers the data, then divides by a quantity 
proportional to the standard deviation of a binomial random variable with success probability pj 



(4) 



where pj = pj/2 and pj — i? X)t=i Xy. To maintain compatibility with smartpca, f lashpca employs 
the same standardization method by default (other scaling methods are available as well). 
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Algorithm 1: Pseudocode for the eigen-decomposition variant of the fast PCA, based on the 
randomized algorithm of [5j for the case where N < p. standardize(-) is the standardization in 
Equation [4] randn(m, n) is a function generating an m x n iid multivariate normal matrix, e 
is the user-defined number of extra dimensions, qr(-) is the QR decomposition, normalize(-) is a 

function that divides each column j by its £2 norm ^Sti Xfj, eigen(-) is the eigen-decomposition 

producing the d-top eigenvectors and vector of d-top eigenvalues A<j. is the matrix of d 
principal components. 

X' = standardize(X) 
R = randn(p, d + e) 
Y = normalize(X'R) 
£ = X'X' T 

for iter = ltmaxiter do 
I Y = normalize(SY) 
end 

[Q,R]=qr(Y) 
B = Q T X' 
S = BB T 

[TJ d , \ d ] = eigen(S) 
Ud = QUd 
Dd = diag(y^) 
P d = U d D d 



Fast Principal Component Analysis 

Performing PCA on large matrices (with both N and p large), is time consuming with traditional ap- 
proaches. To enable fast PCA, we employ an algorithm based on a randomized PCA approach [5j. Briefly, 
randomized PCA relies on first constructing a relatively small matrix that captures the top eigenvalues 
and eigenvectors of the original data, with high probability. Next, standard SVD or eigen-decomposition 
is performed on this reduced matrix, producing near identical results to what would have been achieved 
using a full analysis of the original data. Since in most genomic applications we are interested only in a 
few of the top eigenvectors of the data (typically 10), this allows reduction of the data to a substantially 
smaller matrix, and the computational cost of decomposing this matrix is negligible (see Algorithm [TJ . 
(Note, however, that this method is general and is just as useful for extracting a much larger number of 
PCs). 

Focusing on the eigen-decomposition approach (Equation [2]), the two main computational bottlenecks 
are (i) computing the N x N covariance matrix S and (ii) when N is large, the eigen-decomposition step 
itself. In our fast PCA approach, the first bottleneck cannot be avoided but can be mitigated through 
parallel computation (see Implementation). The second bottleneck is circumvented via the randomized 
approach, by constructing a matrix B of size (d + e) x p, from which the small (d + e) x (d + e) matrix 
BB T is formed and used for the eigen-decomposition, where d is the number of required eigenvectors 
(say 10), and e is the number of auxiliary dimensions used to increase accuracy which can be discarded 
later. We have found that a total dimension of d + e = 200 is more than sufficient for producing good 
results in the first 10 PCs; hence the eigen-decomposition need only be performed on 200 x 200 matrix 
while producing near identical results to a full PCA on the original data. 

Another computational shortcut for the case where N > p, applying equally to PCA and to ran- 
domized PCA, is simply transposing the data X, then standardizing the rows instead of the columns. 
An identical PCA algorithm is then run on the transposed data, with the only difference being that the 
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estimated matrix will now contain the top-<i eigenvectors of X' T X' instead of X'X' T , and hence the 
final d principal components will be = X'U^. This procedure makes it possible to analyze large 
datasets with N ^> p at a cost not much greater than when N = p. 

While the SVD approach is generally recommended for reasons of better numerical stability and 
speed, we have found that when N and p are in the thousands, the above eigen-decomposition approach 
is substantially faster since the matrix multiplication is trivially parallelizable (and is parallelized in 
practice in f lashpca), allowing the decomposition to be performed on the small matrix S = BB T which 
is of size (d + e) x (d + e) rather than a more expensive non-parallelized SVD of the matrix B, which is 
an N x (d + e) matrix, with no discernible effect on accuracy of the top principal components; however, 
both methods are implemented in f lashpca (yet another possibility is to perform SVD on S, but this is 
not implemented yet). 

Implementation 

f lashpca is implemented in C++ and relies on Eigen [16] , a CH — h header-only library of numerical linear 
algebra algorithms, which allows for native parallelization of certain computations through OpenMP 
threads when multiple CPU cores are available, f lashpca natively reads PLINK [15] SNP-major BED 
files, avoiding the need to convert these files to other formats, f lashpca is licensed under the GNU Public 
License (GPL) v3; source code and documentation are available at https://github.com/gabraham/ 
f lashpca 

Prior to PCA, thinning of the SNPs by LD and removal of regions with known high LD or other 
artefacts such as inversions have been recommended [l][l4], as high correlations between the SNPs can 
distort the resulting eigenvectors. For this purpose we recommend using PLINK v2^]which is substantially 
faster than PLINK vl.07. Reducing a dataset to ~ 10,000-50,000 SNPs is usually sufficient to achieve 
an accurate PCA, and can be done using — indep-pairwise. 
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Figure Legends 

Figure 1: (a) The first two principal components from analyzing the HapMap3 dataset. (b) Scatter 
plots showing near-perfect absolute Pearson correlation (lower left-hand corner) between the 1st PCs esti- 
mated by smartpa, f lashpca, shellfish, and R's prcomp (using the standardization from Equation |4| . 
Note that since eigenvectors are only defined up to sign, the correlations may be negative as well as 
positive. In addition, the scale of the PCs may differ between the methods, however, this has no bearing 
on the interpretation of the PCs. 

Figure 2: Total wall time (seconds) for f lashpca versus EIGENSOFT's smartpca and shellfish on 
increasing subsets of the celiac disease dataset, employing multi-threaded mode (8 threads), using 43,049 
SNPs. shellfish did not complete PCA for the N > 50,000 subsets, and smartpca was stopped 
after 100,000sec. The results shown are averages over 3 runs. Results for N < 15,000 are based on 
subsamples of the original dataset N = 16,002 (light blue background), whereas results for N > 50,000 
are based on duplicating the original samples (light yellow background). 
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Supporting Information 

File SI: Concordance in principal components 1-10 between smartpca, f lashpca, shellfish, and R's 
prcomp on the HapMap3 dataset. 



